{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用PAI预置算法微调模型\n",
    "\n",
    "预训练模型（pre-trained model）是通过在大规模数据集上进行训练，从而学习到数据的特征表示的深度学习模型。因为模型是通过大规模的数据进行预训练，因而可以通过少量的数据集进行训练，避免从头训练模型的高额成本。在应用时，预训练模型可以被作为基础模型，然后在特定任务的有标注数据集上进行微调，从而适应特定任务的要求。\n",
    "\n",
    "PAI在公共模型仓库中，提供了不同领域，包括计算机视觉、自然语言处理、语音等的常见热门预训练模型，例如 `QWen`、`Bert`、`ChatGLM`、`LLama2`、`StableDiffusion 2.1` 等，并提供了相应的预置算法，用户仅需提供数据集，即可在PAI上完成模型微调训练。\n",
    "\n",
    "在本示例中，我们将以`Bert`模型为示例，展示如何使用PAI提供的预置算法对模型进行微调训练。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## 安装和配置SDK\n",
    "\n",
    "\n",
    "我们需要首先安装PAI Python SDK以运行本示例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "!python -m pip install --upgrade alipai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "SDK需要配置访问阿里云服务需要的AccessKey，以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后，通过在**命令行终端** 中执行以下命令，按照引导配置密钥、工作空间等信息。\n",
    "\n",
    "\n",
    "```shell\n",
    "\n",
    "# 以下命令，请在 命令行终端 中执行.\n",
    "\n",
    "python -m pai.toolkit.config\n",
    "\n",
    "```\n",
    "\n",
    "我们可以通过以下代码验证配置是否已生效。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "import pai\n",
    "from pai.session import get_default_session\n",
    "\n",
    "print(pai.__version__)\n",
    "\n",
    "sess = get_default_session()\n",
    "\n",
    "# 获取配置的工作空间信息\n",
    "assert sess.workspace_name is not None\n",
    "print(sess.workspace_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 查看PAI提供的预训练模型\n",
    "\n",
    "我们可以通过参数`provider`为`pai`，获取`PAI`公共模型仓库下的模型，其中包含了PAI提供的模型和从开源社区精选的模型。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.model import RegisteredModel\n",
    "\n",
    "\n",
    "data = [[\"ModelName\", \"Task\", \"Revision\"]]\n",
    "\n",
    "# 获取公共模型仓库'pai'提供的模型列表\n",
    "for m in RegisteredModel.list(model_provider=\"pai\"):\n",
    "    revision = m.version_labels.get(\"revision\")\n",
    "    license = m.version_labels.get(\"license\")\n",
    "    task = m.task\n",
    "    data.append([m.model_name, task, revision])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import HTML, display\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        \"<table><tr>{}</tr></table>\".format(\n",
    "            \"</tr><tr>\".join(\n",
    "                \"<td>{}</td>\".format(\"</td><td>\".join(str(_) for _ in row))\n",
    "                for row in data\n",
    "            )\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用模型的预置算法微调训练\n",
    "\n",
    "通过`model_name`和`model_provider`参数，我们可以获取PAI提供的预训练模型(`RegisteredModel`对象)，`RegisteredModel`对象包含了模型所在的OSS Bucket信息，以及模型的预训练算法配置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.model import RegisteredModel\n",
    "\n",
    "# 获取PAI模型仓库中的bert-base-uncased模型\n",
    "m = RegisteredModel(\n",
    "    model_name=\"bert-base-uncased\",\n",
    "    model_provider=\"pai\",\n",
    ")\n",
    "\n",
    "print(m)\n",
    "\n",
    "# 查看模型的公共读OSS Bucket路径\n",
    "print(m.model_data)\n",
    "# 查看模型的训练算法配置\n",
    "print(m.training_spec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取 `bert-base-uncased` 模型的预置微调算法，以及算法的超参定义和输入数据定义。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from pai.estimator import AlgorithmEstimator\n",
    "\n",
    "\n",
    "# 通过注册模型的配置，获取相应的预训练算法\n",
    "est: AlgorithmEstimator = m.get_estimator()\n",
    "\n",
    "# 查看算法的超参定义\n",
    "print(json.dumps(est.hyperparameter_definitions, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 查看算法的输入输出通道定义\n",
    "print(json.dumps(est.input_channel_definitions, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 默认的超参信息\n",
    "print(\"before\")\n",
    "print(est.hyperparameters)\n",
    "\n",
    "\n",
    "# 配置超参\n",
    "est.set_hyperparameters(batch_size=32)\n",
    "\n",
    "print(\"after\")\n",
    "print(est.hyperparameters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "模型默认自带了测试的训练数据，例如BERT模型提供的预置算法可以用于训练一个文本分类模型，默认提供了[情感分类数据集sst2](https://huggingface.co/datasets/sst2)，可以直接用于模型的微调训练。\n",
    "训练数据格式为一个`jsonline`文件，每一行为一个json对象，包含了`label`和`text`两个字段，分别表示文本的标签和文本内容。\n",
    "\n",
    "```json\n",
    "{\"label\": \"negative\", \"text\": \"hide new secretions from the parental units \"}\n",
    "{\"label\": \"negative\", \"text\": \"contains no wit , only labored gags \"}\n",
    "{\"label\": \"positive\", \"text\": \"that loves its characters and communicates something rather beautiful about human nature \"}\n",
    "...\n",
    "...\n",
    "\n",
    "```\n",
    "\n",
    "用户可以参考以上的数据格式准备数据，当前示例中，我们将基于PAI提供的数据集对模型进行微调训练。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取模型自带的训练输入\n",
    "default_inputs = m.get_estimator_inputs()\n",
    "\n",
    "# 默认的算法训练输入，包含了算法使用的预训练模型，训练数据，以及验证数据。\n",
    "print(json.dumps(default_inputs, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们将模型配置的默认的数据集作为训练输入，使用模型预置的PAI算法提交训练作业。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "est.fit(inputs=default_inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在训练结束之后获取产出模型的OSS Bucket路径。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 查看输出模型路径\n",
    "print(est.model_data())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
